A small corpus of Nluu
Nluu is an endangered language of the Tuu language family from Afrika. The language is not being actively spoken since its handful of speakers (2 as of 2020) live in different villages and do not have direct contact to each other. Nluu has one of the most complex sounding phonetic inventories in the world with over 100 different phonemes, 45 of which are clicks.
A transcribed recording session of Nluu was kindly provided to us by Alena Witzlack-Makarevich (of Hebrew University of Jerusalem). I have converted this corpus to a CSV (comma-separated values) file that can be easily imported to R.
Let’s load the corpus and inspect it!
corpus <- read_csv("nuu.csv", col_types = cols())
corpus
inspect the corpus data and describe the meaning of the individual columns as well as the structural relationships between them. What are the objects of description in this dataset? Sketch a simple entity-relationship diagram for this dataset and insert it as an image.

Inspectign individual entries
Extract the 24th utterance and count it’s elements (words and morphemes). Write down this utterance as a glossed example.
Part of speech exploration
what are the 5 most frequent part of speech types in the corpus?
sort(table(corpus$ps),decreasing = TRUE)[1:5]
pro n part vtr vitr
5955 4668 4354 3135 2505
find the 10 most frequent nouns
sort(table(n_list$word),decreasing = TRUE)[1:10]
ki ǀʼhuunsi gǁain gao tyuin ǀoba ǃqhaa
155 120 86 59 57 50 46
ǂoo nǁang ǀʼhuun
45 43 43
add a new column to the dataset, stem_lexical_type. It should be “pro” for pronouns, “n” for nouns, “v” for verbs and “other” for everythign else.
corpus <- mutate(corpus,stem_lexical_type =case_when(
ps == 'pro' ~ "pro",
ps == "n" ~ "n",
ps %in% c("vtr", "vitr","vatr") ~ "v",
TRUE ~ "others"
))
corpus
TODO (advanced, only if you want a challenge!): and now something more difficult — do the same as above, but at the word level and not the stem level! Call this column word_lexical_type
Noun to verb ratio
Noun to verb ratio is a number that tells us how many nominal elements speakers produce in relation to verbal elements — or roughtly, how many nouns are there per sentence. This is a measure of referential density, or how much specific information speakers tend to give per unit of event or state description. While in some languages/cultures one expects to be rather detailed here, so with a high noun to verb ratio, (e.g. ‘A girl wearing a flowery bonet entered the bookstore with large windows and smiled at the old bookkeeper who was just enjoying his morning coffee’) where in some other language the same information could be conveyed as (‘came in, smiled’).
Noun to verb ratio is computed as follows (the \(+1\) in the denominator to prevent division by zero):
\[
N2V = \frac{n(Noun)}{n(Verb)+1}
\]
Nouns here may or may not include other elements such as pronouns or demonstrative pronouns. We compute N2V per utterance.
#compute the N2V for all the utterances in the corpus, once taking both pronouns and nouns, and once takign only nouns into account. The output here should be a table with columns record.id, speaker, N2V.NPro and N2V.N, with one row per utterance.
LS0tCnRpdGxlOiAiQSBjb3JwdXMgb2YgTmx1dSIKb3V0cHV0OgogIGh0bWxfZG9jdW1lbnQ6CiAgICBkZl9wcmludDogcGFnZWQKICBwZGZfZG9jdW1lbnQ6IGRlZmF1bHQKICBodG1sX25vdGVib29rOiBkZWZhdWx0Ci0tLQoKYGBge3Igc2V0dXAsIGluY2x1ZGUgPSBGQUxTRX0KbGlicmFyeSh0aWR5dmVyc2UpCgpgYGAKCiMgQSBzbWFsbCBjb3JwdXMgb2YgTmx1dQoKCipObHV1KiBpcyBhbiBlbmRhbmdlcmVkIGxhbmd1YWdlIG9mIHRoZSAqVHV1KiBsYW5ndWFnZSBmYW1pbHkgZnJvbSBBZnJpa2EuIFRoZQpsYW5ndWFnZSBpcyBub3QgYmVpbmcgYWN0aXZlbHkgc3Bva2VuIHNpbmNlIGl0cyBoYW5kZnVsIG9mIHNwZWFrZXJzICgyIGFzIG9mIAoyMDIwKSBsaXZlIGluIGRpZmZlcmVudCB2aWxsYWdlcyBhbmQgZG8gbm90IGhhdmUgZGlyZWN0IGNvbnRhY3QgdG8gZWFjaCBvdGhlci4gCk5sdXUgaGFzIG9uZSBvZiB0aGUgbW9zdCBjb21wbGV4IHNvdW5kaW5nIHBob25ldGljIGludmVudG9yaWVzIGluIHRoZSB3b3JsZCB3aXRoCm92ZXIgMTAwIGRpZmZlcmVudCBwaG9uZW1lcywgNDUgb2Ygd2hpY2ggYXJlIGNsaWNrcy4gCgpBIHRyYW5zY3JpYmVkIHJlY29yZGluZyBzZXNzaW9uIG9mICpObHV1KiB3YXMga2luZGx5IHByb3ZpZGVkIHRvIHVzIGJ5IEFsZW5hIApXaXR6bGFjay1NYWthcmV2aWNoIChvZiBIZWJyZXcgVW5pdmVyc2l0eSBvZiBKZXJ1c2FsZW0pLiBJIGhhdmUgY29udmVydGVkIHRoaXMKY29ycHVzIHRvIGEgQ1NWIChjb21tYS1zZXBhcmF0ZWQgdmFsdWVzKSBmaWxlIHRoYXQgY2FuIGJlIGVhc2lseSBpbXBvcnRlZCB0bwpSLiAKCkxldCdzIGxvYWQgdGhlIGNvcnB1cyBhbmQgaW5zcGVjdCBpdCEKCmBgYHtyfQpjb3JwdXMgPC0gcmVhZF9jc3YoIm51dS5jc3YiLCAgY29sX3R5cGVzID0gY29scygpKQpjb3JwdXMKYGBgCgojIGluc3BlY3QgdGhlIGNvcnB1cyBkYXRhIGFuZCBkZXNjcmliZSB0aGUgbWVhbmluZyBvZiB0aGUgaW5kaXZpZHVhbCBjb2x1bW5zIGFzIHdlbGwgYXMgdGhlIHN0cnVjdHVyYWwgcmVsYXRpb25zaGlwcyBiZXR3ZWVuIHRoZW0uIFdoYXQgYXJlIHRoZSBvYmplY3RzIG9mIGRlc2NyaXB0aW9uIGluIHRoaXMgZGF0YXNldD8gU2tldGNoIGEgc2ltcGxlIGVudGl0eS1yZWxhdGlvbnNoaXAgZGlhZ3JhbSBmb3IgdGhpcyBkYXRhc2V0IGFuZCBpbnNlcnQgaXQgYXMgYW4gaW1hZ2UuIApgYGB7ciBlY2hvPUZBTFNFLCBvdXQud2lkdGg9JzEwMCUnfQprbml0cjo6aW5jbHVkZV9ncmFwaGljcygnLi9ubHV1IHN0cnVjdHVyYWwgZGlhZ3JhbS5qcGVnJykKYGBgCgoKCiMgSW5zcGVjdGlnbiBpbmRpdmlkdWFsIGVudHJpZXMKCiMgRXh0cmFjdCB0aGUgMjR0aCB1dHRlcmFuY2UgYW5kIGNvdW50IGl0J3MgZWxlbWVudHMgKHdvcmRzIGFuZCBtb3JwaGVtZXMpLiBXcml0ZSBkb3duIHRoaXMgdXR0ZXJhbmNlIGFzIGEgZ2xvc3NlZCBleGFtcGxlLiAKCmBgYHtyfQp1dHRyYW5jZV8yNCA8LSBjb3JwdXMgJT4lIGZpbHRlcihyZWNvcmQuaWQ9PSIyNCIpCnV0dHJhbmNlXzI0ICU+JSBkaXN0aW5jdCh3b3JkKSAlPiUgY291bnQoKQojIDQgZGlzdGluY3Qgd29yZHMKdXR0cmFuY2VfMjQgJT4lIGRpc3RpbmN0KG1vcnBoZW1lLmlkKSAlPiUgY291bnQoKQojIDYgZGlzdGluY3QgbW9ycGhtZXMKdXR0cmFuY2VfMjQgJT4lIHNlbGVjdChnbG9zcywgd29yZCkKIyAKYGBgCgojIFBhcnQgb2Ygc3BlZWNoIGV4cGxvcmF0aW9uCgojIHdoYXQgYXJlIHRoZSA1IG1vc3QgZnJlcXVlbnQgcGFydCBvZiBzcGVlY2ggdHlwZXMgaW4gdGhlIGNvcnB1cz8KYGBge3J9CnNvcnQodGFibGUoY29ycHVzJHBzKSxkZWNyZWFzaW5nID0gVFJVRSlbMTo1XQpgYGAKCiMgZmluZCB0aGUgMTAgbW9zdCBmcmVxdWVudCBub3VucwpgYGB7cn0Kbl9saXN0IDwtIGNvcnB1cyAlPiUgZmlsdGVyKHBzPT0ibiIpCnNvcnQodGFibGUobl9saXN0JHdvcmQpLGRlY3JlYXNpbmcgPSBUUlVFKVsxOjEwXQojIGR1ZSB0byBlbmNvZGluZyBwcm9ibGVtLCB0aGUgb3V0cHV0IGJlbG93IG1heSBub3QgYmUgY29uc2lzdGVudCB3aXRoIHRoZSBvcmlnaW5hbApgYGAKCiMgYWRkIGEgbmV3IGNvbHVtbiB0byB0aGUgZGF0YXNldCwgYHN0ZW1fbGV4aWNhbF90eXBlYC4gSXQgc2hvdWxkIGJlICJwcm8iIGZvciBwcm9ub3VucywgIm4iIGZvciBub3VucywgInYiIGZvciB2ZXJicyBhbmQgIm90aGVyIiBmb3IgZXZlcnl0aGlnbiBlbHNlLgoKCmBgYHtyfQpjb3JwdXMgPC0gbXV0YXRlKGNvcnB1cyxzdGVtX2xleGljYWxfdHlwZSA9Y2FzZV93aGVuKAogIHBzID09ICdwcm8nIH4gInBybyIsCiAgcHMgPT0gIm4iICB+ICJuIiwKICBwcyAlaW4lIGMoInZ0ciIsICJ2aXRyIiwidmF0ciIpIH4gInYiLAogIFRSVUUgfiAib3RoZXJzIgopKQpjb3JwdXMKYGBgCgoqKlRPRE8qKiAoYWR2YW5jZWQsIG9ubHkgaWYgeW91IHdhbnQgYSBjaGFsbGVuZ2UhKTogYW5kIG5vdyBzb21ldGhpbmcgbW9yZSBkaWZmaWN1bHQg4oCUIGRvIHRoZSBzYW1lIGFzIGFib3ZlLCBidXQgYXQgdGhlIHdvcmQgbGV2ZWwgYW5kIG5vdCB0aGUgc3RlbSBsZXZlbCEgQ2FsbCB0aGlzIGNvbHVtbiBgd29yZF9sZXhpY2FsX3R5cGVgCgojIE5vdW4gdG8gdmVyYiByYXRpbwoKCk5vdW4gdG8gdmVyYiByYXRpbyBpcyBhIG51bWJlciB0aGF0IHRlbGxzIHVzIGhvdyBtYW55IG5vbWluYWwgZWxlbWVudHMgc3BlYWtlcnMgCnByb2R1Y2UgaW4gcmVsYXRpb24gdG8gdmVyYmFsIGVsZW1lbnRzIOKAlCBvciByb3VnaHRseSwgaG93IG1hbnkgbm91bnMgYXJlIHRoZXJlIApwZXIgc2VudGVuY2UuIFRoaXMgaXMgYSBtZWFzdXJlIG9mIHJlZmVyZW50aWFsIGRlbnNpdHksIG9yIGhvdyBtdWNoIHNwZWNpZmljCmluZm9ybWF0aW9uIHNwZWFrZXJzIHRlbmQgdG8gZ2l2ZSBwZXIgdW5pdCBvZiBldmVudCBvciBzdGF0ZSBkZXNjcmlwdGlvbi4gV2hpbGUgCmluIHNvbWUgbGFuZ3VhZ2VzL2N1bHR1cmVzIG9uZSBleHBlY3RzIHRvIGJlIHJhdGhlciBkZXRhaWxlZCBoZXJlLCBzbyB3aXRoIGEgCmhpZ2ggbm91biB0byB2ZXJiIHJhdGlvLCAgKGUuZy4gKidBIGdpcmwgd2VhcmluZyBhIGZsb3dlcnkgYm9uZXQgZW50ZXJlZCB0aGUgCmJvb2tzdG9yZSB3aXRoIGxhcmdlIHdpbmRvd3MgYW5kIHNtaWxlZCBhdCB0aGUgb2xkIGJvb2trZWVwZXIgd2hvIHdhcyBqdXN0IAplbmpveWluZyBoaXMgbW9ybmluZyBjb2ZmZWUnKikgd2hlcmUgaW4gc29tZSBvdGhlciBsYW5ndWFnZSB0aGUgc2FtZSBpbmZvcm1hdGlvbiAKY291bGQgYmUgY29udmV5ZWQgYXMgKConY2FtZSBpbiwgc21pbGVkJyopLiAKCk5vdW4gdG8gdmVyYiByYXRpbyBpcyBjb21wdXRlZCBhcyBmb2xsb3dzICh0aGUgJCsxJCBpbiB0aGUgZGVub21pbmF0b3IgdG8gCnByZXZlbnQgZGl2aXNpb24gYnkgemVybyk6CgokJApOMlYgPSBcZnJhY3tuKE5vdW4pfXtuKFZlcmIpKzF9CiQkCgpOb3VucyBoZXJlIG1heSBvciBtYXkgbm90IGluY2x1ZGUgb3RoZXIgZWxlbWVudHMgc3VjaCBhcyBwcm9ub3VucyBvciAKZGVtb25zdHJhdGl2ZSBwcm9ub3Vucy4gV2UgY29tcHV0ZSBOMlYgcGVyIHV0dGVyYW5jZS4KCiNjb21wdXRlIHRoZSBOMlYgZm9yIGFsbCB0aGUgdXR0ZXJhbmNlcyBpbiB0aGUgY29ycHVzLCBvbmNlIHRha2luZyBib3RoIHByb25vdW5zIGFuZCBub3VucywgYW5kIG9uY2UgdGFraWduIG9ubHkgbm91bnMgaW50byBhY2NvdW50LiBUaGUgb3V0cHV0IGhlcmUgc2hvdWxkIGJlIGEgdGFibGUgd2l0aCBjb2x1bW5zIGByZWNvcmQuaWRgLCBgc3BlYWtlcmAsIGBOMlYuTlByb2AgYW5kIGBOMlYuTmAsIHdpdGggb25lIHJvdyBwZXIgdXR0ZXJhbmNlLiAKYGBge3J9CnJlX2Rlbl90YWJsZSA8LSBjb3JwdXMgJT4lIGdyb3VwX2J5KHJlY29yZC5pZCwgc3BlYWtlcikgJT4lIHN1bW1hcml6ZShOMlYuTlBybyA9IHN1bShzdGVtX2xleGljYWxfdHlwZSA9PSAicHJvIiwgc3RlbV9sZXhpY2FsX3R5cGUgPT0gIm4iICkvKDEgKyBzdW0oc3RlbV9sZXhpY2FsX3R5cGUgPT0gInYiKSksIE4yVi5OID0gc3VtKHN0ZW1fbGV4aWNhbF90eXBlID09ICJuIikvKDEgKyBzdW0oc3RlbV9sZXhpY2FsX3R5cGUgPT0gInYiKSkpIAoKCnJlX2Rlbl90YWJsZQojICJwcm8iIGFuZCAibiIgc2hvdWxkIGJlIGFibGUgdG8gZ3JvdXBlZCBtb3JlIG5lYXRseQpgYGAKCiMgb2J0YWluIHRoZSBOMlYgc3VtbWFyaWVzIChtZWRpYW4gYW5kIHN0YW5kYXJkIGRldmlhdGlvbikgb2YgTjJWIHBlciBzcGVha2VyLiBUaGUgb3V0cHV0IHNob3VsZCBiZSBhIHRhYmxlIHdpdGggY29sdW1ucyBgc3BlYWtlcmAsIGBOMlYuTlByby5tZWRpYW5gLCBgTjJWLk5Qcm8uc2RgLCBgTjJWLk4ubWVkaWFuYCwgYE4yVi5OLnNkYApgYGB7cn0Kc3BlYWtlcl9yYXRpbyA8LSByZV9kZW5fdGFibGUgJT4lIGdyb3VwX2J5KHNwZWFrZXIpICU+JSBzdW1tYXJpemUoTjJWLk5Qcm8ubWVkaWFuID0gbWVkaWFuKE4yVi5OUHJvKSwgTjJWLk5Qcm8uc2QgPSBzZChOMlYuTlBybyksIE4yVi5OLm1lZGlhbiA9IG1lZGlhbihOMlYuTiksIE4yVi5OLnNkID0gc2QoTjJWLk4pKQpzcGVha2VyX3JhdGlvCmBgYAoKI3Byb2R1Y2UgYSBbdmlvbGluIHBsb3QgdmlzdWFsaXNhdGlvbl0oaHR0cHM6Ly9nZ3Bsb3QyLnRpZHl2ZXJzZS5vcmcvcmVmZXJlbmNlL2dlb21fdmlvbGluLmh0bWwpIGNvbXBhcmluZyB0aGUgTjJWIHJhdGlvcyBwZXIgc3BlYWtlci4gT25seSB0YWtlIHRoZSBOMlYgZm9yIG5vdW5zIChpZ25vcmluZyBwcm9ub3VucykgYW5kIHJlc3RyaWN0IHRoZSBkYXRhIHRvIG9ubHkgdGhlIG1vc3QgY29tbW9uIHNwZWFrZXJzIChjdXRvZmYgY3JpdGVyaWEgb24geW91ciBkaXNjcmV0aW9uKQoKYGBge3J9CgpnZ3Bsb3QocmVfZGVuX3RhYmxlLCBhZXMoeD1zcGVha2VyLCB5PSBOMlYuTikpICsKICBnZW9tX3BvaW50KCkgKwogIGdlb21fdmlvbGluKGFlcyhmaWxsPWZhY3RvcihzcGVha2VyKSkpCgpgYGAKCgo=